Reinforcement learning evaluation of treatment policies for patients with hepatitis C virus

Background Evaluation of new treatment policies is often costly and challenging in complex conditions, such as hepatitis C virus (HCV) treatment, or in limited-resource settings. We sought to identify hypothetical policies for HCV treatment that could best balance the prevention of cirrhosis while preserving resources (financial or otherwise). Methods The cohort consisted of 3792 HCV-infected patients without a history of cirrhosis or hepatocellular carcinoma at baseline from the national Veterans Health Administration from 2015 to 2019. To estimate the efficacy of hypothetical treatment policies, we utilized historical data and reinforcement learning to allow for greater flexibility when constructing new HCV treatment strategies. We tested and compared four new treatment policies: a simple stepwise policy based on Aspartate Aminotransferase to Platelet Ratio Index (APRI), a logistic regression based on APRI, a logistic regression on multiple longitudinal and demographic indicators that were prespecified for clinical significance, and a treatment policy based on a risk model developed for HCV infection. Results The risk-based hypothetical treatment policy achieved the lowest overall risk with a score of 0.016 (90% CI 0.016, 0.019) while treating the most high-risk (346.4 ± 1.4) and the fewest low-risk (361.0 ± 20.1) patients. Compared to hypothetical treatment policies that treated approximately the same number of patients (1843.7 vs. 1914.4 patients), the risk-based policy had more untreated time per patient (7968.4 vs. 7742.9 patient visits), signaling cost reduction for the healthcare system. Conclusions Off-policy evaluation strategies are useful to evaluate hypothetical treatment policies without implementation. If a quality risk model is available, risk-based treatment strategies can reduce overall risk and prioritize patients while reducing healthcare system costs. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-022-01789-7.


Background
Health system stressors occur in multiple medical contexts and can be exacerbated by limited resources (e.g., limited capacity or budget). This issue has become particularly apparent during the COVID-19 pandemic when resources such as personal protective equipment, intensive care unit capacity, and treatments have been in high demand but under limited supply [1][2][3]. During these situations, health systems must determine the most efficient and effective way to allocate scarce resources within the constraints of their health system, often to large populations of patients [3]. Some health systems take a first-come-first-serve approach, whereas Page 2 of 10 Oselio et al. BMC Medical Informatics and Decision Making (2022) 22:63 others prioritize patients according to their risk of disease progression or complications. Allocation of care fairly and equitably is essential, particularly considering historic inequities with high barriers to care and worse health outcomes among racial and ethnic minorities and patients of low socioeconomic status. These policies have far-reaching effects and must be guided by the medical ethics [3][4][5][6]. Unfortunately, it is difficult to evaluate various approaches until after implementation. Although simulation modeling can be used to evaluate theoretical decision paths in advance, they are based on assumptions that may fail to provide unbiased evaluations of the hypothetical treatment policy [7]. These issues highlight a need for better methods to evaluate proposed policies before clinical implementation. Herein, we used hepatitis C virus (HCV) treatment with direct-acting antivirals (DAA) as a case study on which to develop a reinforcement learning approach to evaluate proposed treatment policies before implementation using historical data. Reinforcement learning provides a framework to utilize data that is longitudinal in nature and contains feedback from decisions made over time, such as assessment of a patient's health status and decision to start treatment and thus evaluate new treatment strategies or policies in medicine [8][9][10][11][12][13][14]. HCV is a valuable case study as it has traditionally been one of the most common risk factors for cirrhosis and liver-related mortality in the United States and Europe. The availability of DAA therapies in 2015 offered a cure for HCV and has helped mitigate HCV-related morbidity and mortality over the last decade. If patients are treated early before the onset of cirrhosis, HCV therapy can halt disease progression and significantly reduce the risk of cirrhosis and liver-related mortality. Patients treated after the onset of cirrhosis have improved quality of life and prognosis. However, they have a persistent risk of hepatocellular carcinoma and liver-related complications, warranting continued observation.
Despite the documented benefits of early treatment of HCV, the initial high cost of DAAs resulted in many payers limiting access to treatment. Some of these policies included restricted access to DAAs based on (1) fibrosis stage (e.g., presence of significant fibrosis or cirrhosis); (2) sobriety from alcohol and illicit substances; or (3) prescriber specialty. This was particularly true for Medicaid plans, which are responsible for covering low socioeconomic status groups and often racial and ethnic minority communities, thus exacerbating disparities in access and care-violating the ethical principle in equal access to care [4]. These policies led to institutional, geographic, and temporal variation in HCV treatment policies, including who is eligible and/or prioritized for treatment, thus creating historical treatment data and a natural case study to develop an approach to evaluate proposed treatment policies before implementation.
The purpose of this study was to evaluate hypothetical policies for HCV treatment before clinical implementation using techniques from reinforcement learning, leveraging historical data collected under an existing treatment policy. With this historical data, we can evaluate new hypothetical treatment policies under the paradigm of reinforcement learning by comparing them with the resulting rewards (lower risk) of the existing treatment policy. In this setting, we wish to evaluate hypothetical treatment policies that could potentially replace the standard-of-care policy. With HCV as our case study, we used a previously published risk prediction model [15] to measure a patient's risk over time. This published model was the basis for patients' risk estimates and the hypothetical risk-based treatment policy. This risk estimates combined with the treatment decisions made for each patient and associated longitudinal and demographic variables allowed us to compare additional hypothetical treatment policies for HCV treatment allocation.

Data collection and study population
The cohort was collected from the Veterans Health Administration (VHA) Corporate Data Warehouse, an electronic repository of clinical and demographic data for Veterans served by the VHA health care system. All patients with a history of HCV (defined by the presence of at least one positive HCV RNA during the study period) were identified from January 2015 to January 2016, with follow-up through 2019. The original cohort study was obtained from a previous publication whose original study date was from January 2000 to January 2016 and had the following exclusions: patients with less than two AST-to-platelet ratio index (APRI) scores, patients with a history of cirrhosis or hepatocellular carcinoma at baseline, and those with baseline APRI > 2.0, and finally excluded those patients who received antiviral treatment regimens but lacked RNA tests to document whether sustained virologic response (SVR) was achieved. The resulting dataset consisted of 169,339 patients. From this dataset and given that the focus of our study was DAA receipt, we excluded patients seen before January 2015 (n = 164,835) and those that were only treated with older non-DAA interferon-based regimens (n = 34). We also excluded patients that were treated with DAAs but did not achieve sustained virologic response (SVR, i.e., HCV cure) (n = 397); and patients that needed to be treated with more than one antiviral regimen (n = 138); finally, we excluded any patients whose treatment occurred before study enrollment (n = 143). This led to 3792 patients in the final dataset.

Study variables
Predictors of interest were selected a priori based on prior work, [15,16] biological plausibility [15][16][17], and clinician input. Demographic variables included age at cohort entry, sex, race, and ethnicity. SVR, i.e., HCV cure, was modeled as a step function of time whereby the variable remains 0 until SVR is achieved, at which point it becomes 1. Laboratory variables included aspartate aminotransferase (AST) ratio, alanine aminotransferase (ALT) ratio, AST/ALT ratio, albumin, total bilirubin, creatinine, blood urea nitrogen, glucose, hemoglobin, platelet count, white blood cell count, sodium, potassium, and chloride. We used all available laboratory measurements for each patient. Patients with more than one measurement for a particular variable on a single day were averaged for that day. For each patient, treatment information is also available (if they were treated), including the type of drug, treatment date, and treatment length. This allows us to create longitudinal patient trajectories and treatment decisions over time.

Reinforcement learning algorithm approach for treatment policies
Reinforcement learning is an area of machine learning that studies how actions are taken over time affect current and downstream outcomes [8][9][10][11][12][13][14]18]. An example is a physician who is considering different treatment options for a specific condition, which may require updating based on a patient's response and any adverse events. Reinforcement learning can help determine a policy of action best for patients on average. This differs from traditional machine learning, where temporality is not taken into effect, and decisions cannot vary over time.
Reinforcement learning can be used to both evaluate sequential decision-making and identify and evaluate new policies in medicine. In the original cohort data ( Fig. 1), treatment decisions are made in a sequential context: the patient is evaluated, and information, or the patient's state, is collected. The clinician then evaluates this state and makes a treatment decision. This cycle then restarts as the patient's state is measured again, and the clinician updates the treatment decision based on this new information. The process continues for a series of time points until the patient does not return for more visits or the end of the follow-up period is reached. In this setting, we wish to evaluate hypothetical treatment policies that could potentially replace the standardof-care policy; ideally, a hypothetical treatment policy would have a lower overall risk to patients. Reinforcement learning offers a framework to encode this process quantitatively.
In general, this requires a multi-step approach: (1) A feature representation is created for the states, actions, and risks for each patient in the dataset, (2) an offpolicy evaluation method is constructed as a way to compare hypothetical treatment policies, and (3) hypothetical clinical policies are tested utilizing the off-policy evaluation method.

Feature Representation for States, Actions, and Risks
A set of longitudinal states, actions, and risks is derived from the cohort data for each patient. A measurement time point for each patient is defined as a day of record for the patient in the cohort data, i.e., if the patient has a recorded longitudinal measurement on that day.
States: The state for each patient at each time point is a vector of the following variables: sodium, creatinine, chloride, total protein count, alkaline phosphatase, APRI, potassium, glucose, platelet count, AST ratio, INR, white blood count, bilirubin, albumin, ALT ratio, AST/ALT ratio, FIB4 score, SVR status, demographic group (Hispanic, White, or Other), and sex. For the demographic information and sex, the variable is coded as 1 for True and 0 for False (demographic is split into three separate variables). Those values are constant over time for the patient. For missing values, the last known value carried forward is used. When a  (1), whose state, i.e., longitudinal and demographic information is measured (2). Given these measurements, the risk to the patient progressing to cirrhosis is then evaluated (3). Finally, following the usual care treatment policy (4), a clinician makes a treatment decision (5) for the patient. The cycle then continues until the patient no longer returns for follow-up or the follow-up period concludes measurement is missing for the patient at all measurement time points, the overall median of that variable across all patients is used. Actions: For the HCV cohort, we consider a binary treatment decision. The action variable at each time point is 1 if the treatment is currently occurring. If there is no measurement time point on the first day of treatment, we define the action at the previous measurement time point as 1. This is done so that the time at which a treatment decision is made is correctly recorded, i.e., during a defined measurement time point.
Risks: The treatment decision under consideration is whether or not to treat with a DAA regimen. The patient's risk is defined as the estimated risk of cirrhosis after 1-year. We calculated patient risk at each measurement time point using a predefined model from Beste et al. [15], which used a multivariate time-varying Cox process to predict the risk of cirrhosis within 1-year from the current time point and was specifically adapted to use longitudinal lab data and to account for treatment and SVR status. The coefficients of the time-varying Cox Model were previously published in Beste et al. (2020) and found in Additional file 1: Table S1 [15].

Off-Policy Evaluation
Off-policy evaluation in this paper is based on the reinforcement learning framework to evaluate the hypothetical scenarios rather than learn a new policy using historical data [19][20][21][22]. To adequately describe this technique, we first introduce some notation. As described in the previous section, we obtain for each patient a sequence of triplets consisting of actionsa i , statess i , and risksr i , H = [(a 1 , s 1 , r 1 ), . . . (a T , s T , r T )] . We mathematically define a treatment policy π as a probability distribution over the possible action space, given a particular state, so that in our case π (1|s) is the probability of treatment with a DAA given the patient states, and π (0|s) = 1 − π(1|s) . Given this treatment policy π , we are interested in estimating the overall risk to patients over time: where γ ∈ [0, 1) is a discount parameter that encodes how much we care about past risk as opposed to current risk of the patient, and r t are the risks observed when implementing the treatment policy π at time t. If enough data is collected under the policy of interest, then it is straightforward to estimate Eq. 1 by simply using the empirical weighted average of the resulting risks for each patient. However, as we wish to use historical cohort data to evaluate π, we resort to a different approach. In particular, we utilize a statistical technique called importance sampling, which reweights the data to estimate risk under the policy of interest (6).
Let π e be the treatment policy we wish to evaluate. Similarly, we define π b as the baseline treatment policy under Eq. 2, which the data was collected. Then, for each patient i and each time t, we can define the importance weight: where (a is the action-state pair for patient i at time j Estimation of π b : To calculate the importance weights, it is necessary to access both π e and π b for all possible actions and states. Since π e is the hypothetical evaluation policy, for every possible state s we know the probability of a treatment decision π e (s) . For π b , however, these probabilities must be estimated from the data.
Since the treatment decision is binary, estimating the baseline treatment policy π b can be done by fitting a probabilistic classifier to the actions and states and using the posterior probabilities of an action a (output of the classifier) given the state s (input to the classifier). Logistic regression was chosen for this task as it has worked well and was a reasonable choice. To account for the fact that DAA treatment regimens are fixed and would not change with implementing a new treatment policy, the learning of π b was split into two phases: pre-treatment π pre b and post-treatment π post b . All action-state pairs from all patients that occurred before treatment or were on the treatment decision day were used for the pre-treatment phase.
For the post-treatment phase, we set π post b (a|s) = π post e (a|s) for all hypothetical treatment policies π e and for all possible treatment decisions and states. Note that, regardless of the estimated post-treatment policyπ post b (a|s) , the importance weight ρ (i) for any t ≥ t * , where t * is the initial treatment time; it is, therefore, unnecessary to estimate π post b (a|s) directly. After estimating the baseline policy π b and calculating the importance weights for each patient, a weighted estimator can then assess the overall average risk of the new treatment policies. For off-policy evaluation, we use a variant of the per-horizon weighted importance sampler found in Doroudi et al. & Raghu et al. [19,21] and shown in Eq. 3.
(1) γ is a discount parameter, L is the set of different lengths of action-state-reward triplets, W l is the fraction of l-length triplets in the dataset, and finally ρ T i is the importance weight. This estimator allows for different length patient trajectories, which is critical in our case.
We utilize a simple bootstrap sample of 100 samples each and report the mean and 90% confidence interval of the overall risk estimate for each hypothetical treatment policy, defined in the next section.

Construction and Evaluation of Hypothetical Clinical Policies
We constructed four hypothetical treatment policies to evaluate to demonstrate the off-policy evaluation method. The first two treatment policies were based on the AST to Platelet Ratio Index (APRI). APRI is a validated predictor of hepatic fibrosis in chronic HCV routinely used in clinical care. Past work has used two APRI scores greater than 2.25 as a surrogate outcome for cirrhosis [16].
Policy 1-Piecewise Treatment Policy: The first treatment policy is a piecewise function (Fig. 2a), where treatment probability increases with APRI score (Eq. 4):

Policy 2-Logistic Regression (APRI only):
The second treatment policy is a data-driven treatment policy using logistic regression with APRI as a single feature. The outcome is a positive diagnosis of cirrhosis (Fig. 2b). The data points used to fit the logistic regression are the last APRI score recorded per patient before a diagnosis of cirrhosis (for positive outcome), or the last APRI score recorded for the patient (for negative outcome), taken from an expanded dataset where the outcome of cirrhosis was identified using liver elastography. The class probabilities are then used as treatment probabilities for the policy.

Policy 3-Logistic Regression (All Variables):
The third treatment policy is a logistic regression on all available state variables, with the same process as the previous treatment policy, and fit on the same expanded dataset as described above.
Policy 4-Risk-Based Policy: The final treatment policy is based on the risk measure used as the evaluator of patient risk. This policy is included to demonstrate the utility of incorporating each patient's risk in the treatment decision. In particular, the probability of treatment π e (s) is based on a logistic function, i.e., 1+e k(r(s)−r 0 ) , where π e (s) is the probability of treatment given the state and r(s) is the calculated risk given the current state. For the policy in the results with the reported risk, r 0 was set to r 0 = 0.003 , and k was set to 1000. These parameters were chosen as they were local minimal for the risk through a grid search. An investigation of how risk behaves as these parameters change is also explored in the results.

Evaluation of treatment strategies using monte Carlo simulation
In addition to evaluating the estimated risk of each hypothetical policy, we also used a Monte Carlo simulation on the data to find the number of patients that the policies would treat. To avoid counterfactual evaluations, the Monte Carlo simulation is only performed on measurement time points for which the treatment has not started or has started at that time point. For each hypothetical treatment policy, we record the average number of patients treated, as well as the number of time points that were not treated. The latter can be thought of as one of many surrogates for cost savings, as delayed treatment implies a delayed expenditure for the hospital. Patients were separated into the low, medium, and high-risk categories by their maximum risk score overall measurement time points. Patients with a risk score above the 90th percentile (r = 0.015) were considered high-risk patients. Patients between the 90th percentile and 50th percentile (r = 0.0029) were considered medium risk, and those in the bottom half of the maximum risk scores were considered low risk. There were 380 high-risk patients, 1530 medium-risk patients, and 1882 low-risk patients used by the Monte Carlo simulation.  Figure 3b shows the associated drop in the risk of developing cirrhosis. Figure 3c shows an example of an untreated patient. Note that the risk of cirrhosis drops significantly as treatment starts; even before SVR is achieved, many patients' longitudinal measurements (including APRI) improve rapidly after the start of treatment. This is evident in Fig. 4a, which compares the risk scores of treated vs. untreated patients; the median risk of treated patients is 69% lower than untreated patients (median 0.0007 vs. 0.0022, respectively). Note that the absolute scale of the risk is not important, rather the relative scale. Figure 4b shows the median risk as a function of the time after the treatment start date. As expected, the risk of developing cirrhosis decreases after treatment and continues to decrease after the end of DAA treatment. Table 1 shows the results from the off-policy evaluation of the proposed policies. The risk-based policy has the lowest average risk of all the policies tested, with a bootstrap 90% confidence interval of (0.016, 0.019), with the full state logistic regression policy following behind. Policy 1 and Policy 2 are statistically nearly identical, as their treatment probability curves were shown to be similar in Interestingly, although the risk-based policy has the lowest average risk of the tested policies, it also treats the least low-risk patients and the highest and mediumrisk patients, showing that it can prioritize patients based on estimated risk. The nearest competitor, Policy 3, also has significantly fewer untreated time points, signaling that the risk-based policy is better at finding higher-risk patients at a lower cost to the health system. Figure 5 shows the sensitivity of the risk-based policy to r 0 and k . As r 0 increases, risk also increases for a fixed k (Fig. 5a), and as k increases, r 0 decreases but stabilizes for a fixed r 0 (Fig. 5b). Note that the Monte Carlo number of patients treated increases as risk decreases, and the number of untreated time points decreases with risk, as expected.

Discussion
This study aimed to demonstrate the effectiveness of a reinforcement learning framework for the evaluation of hypothetical treatment policies using historical data. Through the case study of HCV treatment, we demonstrated that off-policy evaluation could be helpful to compare different intervention strategies in advance. Using this approach, we found that a risk-based policy had the best estimated average risk score and thus best prioritized treating medium-and high-risk patients over low-risk patients compared to other hypothetical treatment policies. The clinical implication in implementing a risk-based policy is more efficient treatment allocation. Therefore, this approach could have helped prioritize short-term access and costs but also could have helped save dollars to avert downstream cirrhosis-related management and complications for large health systems.
This type of evaluation approach can be helpful for other disease states where similar policy comparisons are needed and can be used as a generalized methodological tool for evaluating treatment strategies without waiting for outcomes from a clinical study. For example, this can be applied to other diseases with high-cost conditions such as cancers [23].
These results also imply that, under financial limitations or treatment scarcity, systematic treatment policies could improve average patient outcomes, as measured by a reduced risk in cirrhosis. While outside of the scope of this paper, we believe that more complicated policies could further improve the average outcome without requiring immediate treatment of all enrolled patients. Indeed, for HCV, more complex classification models could also be extended to create new treatment policies.
While the technique of off-policy evaluation is only applied in this study to treatment policies for HCV, the methodology is applicable in many different scenarios.
The key ingredients necessary for policy evaluation here are a predefined risk/reward measurement or model, a well-defined action space, and a series of states. The careful construction of these components is essential, as different design decisions regarding the construction of these variables will significantly affect the overall outcome. Other work has considered these techniques in the cases of sepsis treatment [21], and more applications of this type are sure to follow. Although simulation modeling can be used to evaluate theoretical decision paths in advance, they are based on assumptions that may fail to provide unbiased evaluations of the hypothetical treatment policy [7]. Future studies will focus on the clinical validation of these results.

Limitations
This approach has several important limitations; first, the methodology assumes that the treatment action is based solely on the state at that particular time and not the past trajectory of the patient. This is not how clinicians operate in practice, although we find it a reasonable Fig. 4 Analysis of risk scores. a Comparison of risk scores in the dataset separated between treated and untreated measurement timepoints. As expected, the untreated timepoints have a higher risk score on average. b Median risk (with 50% percentile interval) striated across amount of time after treatment start date. As expected, risk continues to decrease after initial treatment approximation here. The second limitation is that missing data were imputed by using the last observation carried forward. Other potential imputation methods could have been utilized that produce less bias in the final policy estimates. Both the baseline and each of the hypothetical treatment policies are assumed to be randomized, i.e., the treatment decisions made at a particular time point are stochastic. This can prove troublesome in implementing these policies in a real-world environment, where hard treatment thresholds are usually favored. Some adjustments can allow for deterministic policies, but this generally requires more data to represent the joint state-action space properly; we leave this to future work. Another important limitation of this method is that although we reported treatment quantity, this work does not explicitly allow for a budget constraint on the amount of treatment given, either in total or over time. An important additional limitation in this work concerns the robustness and generalizability of the technique. From a methodological perspective, a more sophisticated means of determining the parameters would be better to ensure the optimality of the policy. A final limitation of this study is that patients treated with DAAs but do not achieve SVR are excluded from this study. Although this is a small population of treated patients, our current technique cannot account for this outcome. In addition, we note that a more generalizable and externally validated risk model would give the method presented in this paper more empirical credibility to inform policy decisions.

Conclusion
New hypothetical treatment policies for HCV were evaluated using a reinforcement learning framework using historical data collected from the VHA. A riskbased policy was shown to prioritize high and mediumrisk patients more effectively while reducing the cost to healthcare systems. The methodology used in this study could be of interest for a better understanding of treatment policies for other diseases.

Fig. 5
Risk sensitivity for k and r 0 . Sensitivity of the risk-based policy to r 0 andk . As r 0 increases, risk also increases for a fixed k(a), and as k increases, r 0 decreases but stabilizes for a fixed r 0 (b)